Finding the beats in a piece of music is an inherently human task. When listening to a song, by instinct, we can tap our foot or nod our head on the rhythm without having to think too much about it. However, for a computer, deciding when exactly is the right time to tap is a highly non-trivial task.
Many algorithms already exist for this, with varying degrees of success. They usually begin by estimating the onsets of the audio sample, i.e. the times at which a note is most likely to start. Then, they produce a beat track that best fits those onsets (a popular algorithm for this step uses dynamic programming). A good implementation of this strategy is found in the LibROSA package, as we will demonstrate below.
The problem with this approach is that in many music genres, such as jazz, many of the most accentuated onsets are not on the beats. This fools the algorithm, which then produces all sorts of undesirable results.
For example, let's see how it does on a small excerpt of a classic song by jazz pianist Keith Jarrett (Bye Bye Blackbird).
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import IPython.display
from PIL import Image
import torch
import beatfinder # This project
sr = beatfinder.constants.sr
hl = beatfinder.constants.hl
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
totensor = beatfinder.model.ToTensor(device)
print(f'Device: {device}')
Let's listen to the excerpt:
offset = 9.4
duration = 10
wav, _ = librosa.load('./data/raw-datasets/SELFMADE/audio/song1.m4a',
offset=offset, duration=duration, sr=sr)
IPython.display.Audio(wav, rate=sr)
To predict where the beats are, librosa first estimates where the notes begin (the onsets) and then fits a beat track on those onsets, as shown below:
onset_env = librosa.onset.onset_strength(wav, sr=sr, hop_length=hl)[1:]
times = librosa.frames_to_time(np.arange(len(onset_env)), sr, hl)
onset_env -= onset_env.min()
onset_env /= onset_env.max()
bpm, librosa_beats = librosa.beat.beat_track(onset_envelope=onset_env, sr=sr, hop_length=hl, units='time')
fig, axes = plt.subplots(3, 1)
fig.set_size_inches(16, 8)
fig.subplots_adjust(hspace=0.75)
axes[0].set_title('raw audio')
axes[0].plot(librosa.samples_to_time(np.arange(len(wav)), sr), wav)
axes[0].set_xlim(0, 10)
axes[0].set_xlabel('Time [sec]')
axes[1].set_title('onset envelope (likelihood of a note to begin there)')
axes[1].plot(times, onset_env)
axes[1].set_xlim(0, 10)
axes[1].set_ylim(0, 2)
axes[2].set_title('librosa\'s beats prediction')
axes[2].plot(times, onset_env)
axes[2].vlines(librosa_beats, 0, 1.5, color='r', alpha=0.8, linestyles='--')
axes[2].set_xlim(0, 10)
axes[2].set_ylim(0, 2);
Let's compare with the ground truth beats:
gt = np.loadtxt('./data/raw-datasets/SELFMADE/beats/song1_excerpt.beats')
ground_truth_beats = gt[(offset < gt) & (gt < offset + duration)] - offset
plt.figure(figsize=(16, 2))
plt.title('beats: librosa vs ground truth')
plt.plot(times, onset_env)
plt.vlines(ground_truth_beats, 0, 1.5, color='g', alpha=0.5, linestyles='--', label='ground truth beats')
plt.vlines(librosa_beats, 0, 1.2, color='r', alpha=0.8, linestyles='--', label='librosa\'s beats')
plt.xlim(0, 10)
plt.ylim(0, 1.5)
plt.legend();
Okay, it got about two beats right (1st and 6th). The problem is that (as is common in jazz) the most emphasised notes are more often not on the beats. Let's listen to the two:
# Librosa's prediction
clicks_lb = librosa.clicks(librosa_beats, sr=sr, hop_length=hl, length=len(wav))
IPython.display.Audio(wav + clicks_lb * 0.3, rate=sr)
# Ground truth
clicks_gt = librosa.clicks(ground_truth_beats, sr=sr, hop_length=hl, length=len(wav))
IPython.display.Audio(wav + clicks_gt * 0.3, rate=sr)
The goal of this project is to explore the use of deep learning to improve this scheme. Rather than computing the beat track on the full onset envelope, we train a neural network to select only the onsets that are most likely to be beats.
onsets = librosa.onset.onset_detect(onset_envelope=onset_env, sr=sr, hop_length=hl)
onsets_times = librosa.frames_to_time(onsets, sr, hl)
idxs = beatfinder.utils.select_onsets(onsets_times, ground_truth_beats)
beats_predicted, _ = beatfinder.utils.beat_track(onsets[idxs])
fig = plt.figure(figsize=(16, 16))
fig.subplots_adjust(hspace=0.5)
plt.subplot(5, 1, 1)
plt.title('First compute the onset envelope with librosa.')
plt.plot(times, onset_env, label='original onset envelope')
plt.xlim(0, 10)
plt.ylim(0, 2)
plt.legend();
plt.subplot(5, 1, 2)
plt.title('Then, detect the peaks to get a list of onsets.')
plt.plot(times, onset_env)
plt.vlines(onsets_times, 0, 1.5, color='k', alpha=0.3, linestyles='--', label='onsets')
plt.xlim(0, 10)
plt.ylim(0, 2)
plt.legend();
plt.subplot(5, 1, 3)
plt.title('Use machine learning to do a binary classification and select the onsets that are most likely to be beats.')
plt.plot(times, onset_env)
plt.vlines(onsets_times, 0, 1.5, color='k', alpha=0.3, linestyles='--')
plt.vlines(onsets_times[idxs], 0, 1.5, color='m', alpha=1, linestyles='--', label='onsets selected by a NN as potential beats')
plt.xlim(0, 10)
plt.ylim(0, 2)
plt.legend();
plt.subplot(5, 1, 4)
plt.title('Use those selected onsets to generate a new onset envelope.')
new_onset_env = np.zeros_like(onset_env)
new_onset_env[onsets[idxs]] = 1
plt.plot(times, new_onset_env, label='new onset envelope')
plt.xlim(0, 10)
plt.ylim(0, 2)
plt.legend();
plt.subplot(5, 1, 5)
plt.title('Generate a beat track with librosa again, but this time using the new onset envelope.')
plt.plot(times, new_onset_env)
plt.vlines(ground_truth_beats, 0, 2, color='g', alpha=0.5, linestyles='--', label='ground truth beats')
plt.vlines(beats_predicted, 0, 1.5, color='r', alpha=0.8, linestyles='--', label='librosa after the NN preselection')
plt.xlim(0, 10)
plt.ylim(0, 2)
plt.legend();
This new prediction is much closer to the ground truth. In fact, most humans would not make the difference; this would count as a correct answer.
So we now have formulated our problem:
Given a music excerpt and a list of onsets, use machine learning to select which of those onsets are more likely to be beats.
The first thing is to find a good representation of music. The most natural thing is to use spectrograms:
n_mels = 256
fmax = 2**14
n_fft = 2**12
spec = librosa.feature.melspectrogram(wav, sr, hop_length=hl, n_mels=n_mels, fmax=fmax, n_fft=n_fft)**2
spec = librosa.power_to_db(spec)
time = librosa.frames_to_time(np.arange(spec.shape[1]), sr, hl)
freq = librosa.mel_frequencies(n_mels=n_mels)[:n_mels//2]
plt.figure(figsize=(16, 6))
plt.pcolormesh(time, freq, spec[:n_mels//2, :])
plt.xlabel('Time [sec]')
plt.ylabel('Frequency [Hz]')
plt.title('Spectrogram');
Spectrograms more or less correspond to how humans perceive sound. We have time on the x-axis, and frequency (pitch) on the y-axis, so that at any given instant of time we know which frequencies are heard. We can see the notes played on the piano by Keith Jarrett at around 500 Hz, as well as Gary Peacock's bass on the bottom.
Since this is an image, we can feed it into a usual 2d convolutional stack. But I didn't get good results with that. Perhaps it has to do with the fact that spotting animal faces in spectrograms is not so useful after all.

(Image generated with https://github.com/L1aoXingyu/Deep-Dream.)
More seriously, an audio signal is inherently a time series, and hence we get better results by treating it that way. So we will view spectrograms as sequences of intensity/frequency curves rather than 2d images and use recurrent neural networks.
fig, ax = plt.subplots()
fig.set_size_inches(16, 4)
ax.axis([0, freq[-1], spec.min(), spec.max()])
ax.set_title('Spectrogram as a sequence of intensity/frequency curves')
ax.set_xlabel('Frequency [Hz]')
ax.set_ylabel('Intensity [Db]')
line, = ax.plot([], [])
line.set_color('g')
def init():
line.set_data([], [])
return line,
def animate(i):
line.set_data(freq, spec[:n_mels//2, 2 * i])
return line,
anim = animation.FuncAnimation(fig,
animate,
init_func=init,
frames=spec.shape[1] // 2,
interval=2 * 1000 * hl / sr,
blit=True);
plt.close(fig)
IPython.display.HTML(anim.to_jshtml())